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Abstract. This paper presents a new approach, based on polynomial opti¬ 
mization and the method of moments, to the problem of anomaly detection. 
The proposed technique only requires information about the statistical mo¬ 
ments of the normal-state distribution of the features of interest and compares 
favorably with existing approaches (such as Parzen windows and 1-class SVM). 
In addition, it provides a succinct description of the normal state. Thus, it 
leads to a substantial simplification of the the anomaly detection problem when 
working with higher dimensional datasets. 


1. Introduction 

Detecting anomalies in data is a task that engineers and scientists of every dis¬ 
cipline encounter on a fairly regular basis. Common approaches to identifying 
anomalous observations include: estimating the probability density function of the 
“normal” state using density-estimation methods, computing the Mahalanobis dis¬ 
tance between a point and a sample distribution, using machine-learning methods 
to learn the normal state and perform 1-class classification, using “whiteness” tests 
to detect when the differences between new samples and predictions (of linear mod¬ 
els) become statistically colored, etc. However, a disadvantage encountered while 
estimating the distribution of the normal state is that one never knows if the avail¬ 
able samples are representative of all the normal behaviors. This is especially true 
when there is a dearth of data with which to construct a reliable probability density 
estimate. In this paper, we address this challenge through the use of recent results 
from the method of moments and polynomial optimization. The main idea is to 
use information about the statistical moments of the normal-state distribution of 
a given set of features to compute an upper-bound on the probability of a given 
observation of a random variable with this distribution. The observation is then 
deemed to be anomalous when this bound falls below a given threshold. While in 
principle computing this bound is a challenging infinite-dimensional problem, as 
shown in the sequel, the use of concepts from the theory of moments allows for 
recasting it into a computationally tractable finite dimensional optimization. 

The generalized moment problem (GMP), defined below, has tremendous mod¬ 
eling power and has found use in a wide range of applications in areas such as 
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probability, financial economics, engineering, and operations research m- How¬ 
ever, its wider application has been held back because, in its fullest generality, the 
GMP is intractable [9]. On the other hand, recent advances in algebraic geometry 
have rendered the GMP tractable when the problem instance is described using 
polynomials mm 

Formally, the GMP is given by: 

fodn 

subject to f K fjd/i ^ 7 j, j = 1 ,..., m 

where K is a Borel subset of M n , .A4(K) is the space of finite signed Borel measures 
on K, .A4(IK) + is the positive cone containing finite Borel measures p on K, { 7 j \ j = 
1 ,..., m} is a set of real numbers, and fj : K -A M are integrable with respect to 
all fi G .M(IK) + for every j = 0,... , m. The ^ stands for either an inequality or 
equality constraint. 

Lasserre Eunnann] showed that semidefinite programming (SDP) can be used 
to obtain an (often) finite sequence of relaxations which can be solved efficiently 
and which converge from above to the optimal solution when the problem has 
polynomial data; thereby giving us a tool with which to attempt to solve very 
difficult (i.e., non-convex) medium-sized problems, given the present state of SDP 
solvers. The monotonic convergence means that the i-th relaxation is useful, even 
when optimality has not been achieved, by providing the user with an upper (i.e. 
optimistic) bound pi to p m0 m in 0. 

In the sequel, we will go over the preliminaries of the moments approach, present 
the aforementioned Lasserre relaxations, and demonstrate how to use them to detect 
anomalous data samples. 


(i) 


GMP: Pmom = sup 

fieX(K)+ JK 


/ 

. Jk 


2. Preliminaries 

For a real symmetric matrix A, A > 0 means A is positive semidefinite. Let 
d G N = {0,1,2,...}, s(d ) = ( n ^ d ), and let v G be the column vector 

containing the canonical polynomial basis monomials 

(2) v(x) = (l,Xi,X 2 , ... ,x n ,x\,xix 2 ,.. .,x 1 x n ,x 2 x 3 , ...,xl,...,xf,..., ,x^) T 

for polynomials of the form p(x) : M n -A M with dimension s(d) and degree at 
most d. M[x] denotes the set of polynomials in x G M n = (aq, X 2 ,.. ., x n ) with real 
coefficients. 

With a G N n = (aq,... ,<a n ), let x a = (x^ 1 , ... ,x^ n ) and y = {y a } denote a 
finite sequence of real variables. For a polynomial p = with coefficients 

a 

p a , let L y (p) denote the linear map 

( 3 ) Ly(p) = PaVa 

a 

which associates p with the moment variables, y a . 

2.1. Moment and Localizing Matrices. The d -th order moment matrix Md(y) 
is indexed by up-to order-d monomial coefficients a,/3 G N n . That is, each entry of 
the s(d) x s(d) matrix is given by 

(4) M d (y)(a,f3) = y a+p , a,/3 G N n , |a|, |/?| < d 
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n 

where |a| = For example, in the case d = 2 and n = 2, if a = (0,1) 

and f3 = (2,0), then M 2 (y)(a, /3) = 2/21 • Usually the (a,/3) is suppressed (since it is 
understood that the first row and column of Md(y ) contains the elements of L y (v)). 
The entire M 2 (y) matrix is given below for illustration. 


( 5 ) 


M 2 (y) 


y 00 

y 10 

2/01 

2/20 

2/11 

2/02 

2/10 

2/20 

2/ii 

2/30 

2/21 

2/12 

2/01 

yn 

2/02 

2/21 

2/12 

2/03 

2/20 

2/30 

2/21 

2/40 

2/31 

2/22 

yn 

2/21 

2/12 

2/31 

2/22 

2/13 

yo2 

2/12 

2/03 

2/22 

2/13 

2/04 


Thus, it is clear that Md(y ) = L y (u(t)u(t) t ). 

The classical problem of moments is concerned with determining if, for a given 
moment sequence { y a }, there is a measure y so that y a = f x a dy for each a. If 
so, we say y has a representing measure which is called determinate if it is unique. 
Loosely speaking, checking existence of such a measure involves testing if Md(y ) is 
positive semidefinite. Readers interested in the technical details are referred to El 
and references therein. 

For a given polynomial g G M[x], the localizing matrix Md(gy) is given by 


(6) M d (gy) = L y (g(x)v(x)v(x) T ) 


This is 

equivalent to shifting 

M d {y) by 

the monomials of <7. 

For example, 

with 

g(x) = ~ 

a — x‘ 

2 

1 - x 2 7 

we 

have 













2/oo 

2/io 

2/oi 

2/20 

2/11 

2/02 


2/20 

2/30 

2/21 

2/40 

2/31 

2/22 



2/io 

2/20 

2/n 

2/30 

2/21 

2/12 


2/30 

2/40 

2/31 

2/50 

2/41 

2/32 

M 2 (gy) 

= a 

2/oi 

2/n 

2/02 

2/21 

2/12 

2/03 

_ 

2/21 

2/31 

2/22 

2/41 

2/32 

2/23 



2/20 

2/30 

2/21 

2/40 

2/31 

2/22 


2/40 

2/50 

2/41 

2/60 

2/51 

2/42 



2/n 

2/21 

2/12 

2/31 

2/22 

2/13 


2/31 

2/41 

2/32 

2/51 

2/42 

2/33 



2/02 

2/12 

2/03 

2/22 

2/13 

2/04 


2/22 

2/32 

2/23 

2/42 

2/33 

2/24 


( 7 ) 


2/02 

2/12 

2/03 

2/22 

2/13 

2/04 

2/12 

2/22 

2/13 

2/32 

2/23 

2/14 

2/03 

2/13 

2/04 

2/23 

2/14 

2/05 

2/22 

2/32 

2/23 

2/42 

2/33 

2/24 

2/13 

2/23 

2/14 

2/33 

2/24 

2/15 

2/04 

2/14 

2/05 

2/24 

2/15 

2/06 


Localizing matrices are used to specify support constraints on /i, as described in 
the next section. 


2.2. Upper-bounds Over Semi-algebraic Sets. The material in this subsec¬ 
tion comes from nn and [I] which discuss the problem of finding bounds on the 
probability that some random variable x belongs to a set S C M n , given some of 
its moments 

Suppose S is of the form 

S = {x £ R n | fj(x) >0, j = 1,... ,m} 


(8) 
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where fj are given polynomial constraints. In terms of the GMP, this problem can 
be expressed as 


( 9 ) 


Pmom — Slip 

AieVW(K) + 


/ 

Jr 


lsdfi 


s.t. / x a dp = 7 j, j = 1 ,... ,ra 

J R n 


where Is is the indicator function of the set S. To solve this problem using SDP 
methods, Lasserre provided the following relaxations 


( 10 ) 


s.t. 


i y < 


Pi = sup 2/0 

y,z 

Hot + = If a j G T 

Mi(y)> 0 
Mi(z) > 0 

Mi- Vj (fjy) >0, j ra 


where fj = |"deg(/j)/ 2 ] and T is the set of indices that correspond to the known 
moment information. 

The following result assures us that pi approaches the optimal value p mom from 
above and tells us when optimality has been reached. This result is included for 
completeness as we will generally not require an optimal solution. This is an advan¬ 
tage because we do not have to worry about extracting optimizer^] from a moment 
matrix that has satisfied the rank condition in Theorem [lj 


Theorem 1. m Let pi be the optimal value of the semidefinite program P 5 . Then: 

(a) : For every i > v, pi > p m om and moreover, pi | p mom as i 00 . 

(b) : If pi is attained at an optimal solution (y,z) which satisfies 


/ -1 -1 \ / Rank Mi (y) = Rank Mi- V (y) 

1 j \ Rank Mi{z) = Rank 

then Pi = Pmom • 

2.3. Lower Bounds. To compute lower bounds, one can use the complement of S, 
instead of S', in and subtract pi from im- However, in the approach presented 
here, the lower bounds do not provide much information because the sets S that 
we will work with are small compared to the range of the data, which means the 
lower bound will usually be zero. 

3. Robust Anomaly Detection 

One approach to anomaly detection is to estimate the probability density func¬ 
tion (PDF) and select a threshold below which incoming samples are classified as 
anomalous. Another is to train a 1-class support vector machine (SVM) and use 
it to classify samples. Here, we will compare against both of these standard tech¬ 
niques. To quantify performance, we will use the threshold-independent receiver 
operating characteristic (ROC) curves and area under curve (AUC) metric; this is 

^There is (currently) no reliable method of extracting optimizers from a moment matrix. 
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a standard way of assessing binary classifier performance. Both ROC and AUC 
are readily obtained using the MATLAB command perf curve. The kernel density 
estimates are obtained using the MATLAB command ksdensity (and the KDE 
toolbox for higher dimensional examples [6]). For the 1-class SVM, we use the 
MATLAB command f itcsvm with the automatic tuning- and “standardized data” 
options enabled. 

We would like to emphasize to the reader, however, that our goal is not to 
estimate the probability density function. Our goal is simpler: to use the upper- 
bound on the probability of the observation to decide whether or not it is anomalous. 
To use for anomaly detection we select r > 0 to specify a neighborhood S which 
contains an incoming measurement. The neighborhood can be a sphere or a bo^] 
and is specified in the constraints fj in P s- The r can be selected as a function of 
the measurement noise, or otherwise small compared to the expected range of the 
data. In fact, r can be set to zero if one wishes. In practice, we would like to use r 
to account for measurement noise. 


3.1. Moment Estimation and Data Whitening. A key advantage of the pro¬ 
posed method is that the information from a data set enters through the moment 
estimates which are computed using Equation (12). 


( 12 ) 


la 


1 N 
- y 

N ^ 


G T 


i=1 


It is known that it is difficult to estimate higher order moments since whatever 
noise is present in the data will be raised to higher powers as well. In view of Equa¬ 
tion (12), one can see that if the data is poorly scaled or biased in some coordinate, 
7 a will increase quickly in those directions and will therefore result in fewer avail¬ 
able moments for those directions and may cause numerical problems for the SDP 
solver. Thus, it is sometimes useful to use the data whitening technique discussed 
in pQ to transform the data so that it has zero mean and unit variance. This may 
be useful in lower-dimensional instances, where it is possible to use these higher 
order moments without the size of the moment matrices exceeding computational 
resources. 

Data whitening is a simple procedure that involves subtracting the mean of 
the data and multiplying by a transformation matrix, in the multivariate case, or 
dividing by the standard deviation in the univariate case. Incoming samples are 
then processed the same way, using the stored mean and transformation, before 
obtaining an upper-bound probability estimate. 


4. Examples 

In this section we present four example^} The first is a ID example that shows 
that our approach compares favorably with traditional kernel probability density 
based anomaly detectors and with the 1-class SVM. We will demonstrate: that 
higher-order moments contain useful information and that this information helps 
best the other two approaches especially when there are fewer data points to learn 
from and when the dimensionality of the data increases. These points will become 


2 S in P $ can be any set of the form in jsj. 

3 We used MATLAB R2013b/R2014a, CVX 2], and SeDuMi |12| to solve our examples. 
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apparent through the first three examples, which increase in dimension. The final 
example shows how our method can be used to recover contours from a noisy binary 
image. 


4.1. Example 1: A Bimodal Distribution. Consider the bimodal distribution 
(3] in Figure [l] This distribution poses a challenge to density estimation tools like 
Parzen windows because the two modes are estimated best using different window 
sizes. 

Figure [l] shows the distribution and the test points, which consist of 300 inliers 
and 50 outliers. 



Figure 1 . Bimodal distribution 


Table [l] shows the AUC values for our experiments, which use r = 0.001 for 
the moments classifier. The reader can see that AUC generally increases as more 
moments are used and that the performance is better with fewer points. The 
difference, however, is small because this is a ID example. 


Table 1 . AUC Summary 


N 

SVM 

Parzen 

m 2 

m 4 

M 6 

M 8 

50 

0.9821 

0.9686 

0.8701 

0.9715 

0.9940 

0.9929 

100 

0.9907 

0.9807 

0.8457 

0.9882 

0.9984 

0.9957 

300 

0.9953 

0.9781 

0.8268 

0.9783 

0.9951 

0.9945 
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Figure 2 . ROC curves when 50 points are used for training 



Figure 3. ROC curves when 300 points are used for training 

4.2. Example 2: 2D Distribution. For higher dimensional datasets, the com¬ 
plexity of estimating PDFs increases greatly. Indeed, the arguably most popular 
engineering software tool, MATLAB, does not have a density estimation function 
for multi-dimensional datasets. 
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Consider the distribution below 

(13) x\ = 2 cos(1.5£) + r]i 

x 2 = 4sinc(0.9£) + r] 2 

with ££[ 7 t / 4 , 7r], Tji^U[— 0.05, 0.05], and ?72~^[0, 0.02]. This distribution has a “P” 
shape and the closed portion presents a problem for the kernel and SVM approaches. 

Figure [4] shows the probability surface and the 350 test points which include 50 
outliers. 



Figure 4. 2D distribution and test points 


Table [2] shows the results of our experiments, using r = 0.001 for the moments 
classifier. Again, the reader can see how the performance increases as more moment 
information is included except this time the difference in performance is greater 
because the data has another dimension. This trend will continue in the third 
example. 


Table 2 . AUC Summary 


N 

SVM 

Parzen 

m 2 

M 3 

m 4 

M 5 

100 

0.9252 

0.7843 

0.6330 

0.9722 

0.9883 

0.9916 

150 

0.9437 

0.8045 

0.6575 

0.9751 

0.9903 

0.9947 

300 

0.9435 

0.8335 

0.6436 

0.9705 

0.9863 

0.9933 
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Figure 5. ROC curves when 100 points are used for training 



Figure 6. ROC curves when 300 points are used for training 

4.3. Example 3: Swiss roll Distribution. The following distribution is the well- 
known “Swiss roll” distribution. The following equations were used to create the 
test and training samples. 
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(14) 

xi = y 1 cos(y 1 ) + 

(15) 

x 2 = yi sin ( 2 / 1 ) + rj 2 

(16) 

x 3 = V 2 + m 

with Vl ~ 

, y 2 ~ U[ 1 , 2 ], r] 1 ~ V(0,0.8 2 ), ~ A/”(0,0.3 2 ), and 773 ~ 


AT(0,0.25 2 ). Figure 0 shows the test points which consist of 300 inliers and 50 
outliers. 



Figure 7. Swiss-roll distribution 


Table [ 3 ] shows the familiar trend of increasing AUC with additional moments and 
excellent performance even with just 100 training samples. This example illustrates 
how the other two standard techniques degrade as dimensionality increases. As 
with the previous examples, r = 0.001 was used for the moments classifier and the 
recommended automatic tuning options were used for other techniques. 


Table 3. AUC Summary 


N 

SVM 

Parzen 

m 2 

m 4 

M 6 

100 

0.7887 

0.8487 

0.8099 

0.9194 

0.9228 

300 

0.8001 

0.9085 

0.8201 

0.9480 

0.9903 
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False Positive Rate 


Figure 8 . ROC curves when 100 points are used for training 



Figure 9. ROC curves when 300 points are used for training 

4.4. Example 4: Contour Detection. The following example is included to 
show the potential of this method to be used in image processing applications. The 
contour image is from E3HI- In this example, we attempt to recover a contour 
corrupted by 5% noise. We calculated the Haralick energy, in two directions and 
























































































12 


J. A. LOPEZ, O. CAMPS, AND M. SZNAIER 


three scales, for (9x9) neighborhoods of each pixel of the clean image, computed 7 ^, 
and the upper-bound probability surface, M 4 , using the noisy image with r = 0.05. 
The three resulting probability images were then averaged. Since there are many 
more white pixels, the contour pixels should have a lower probability upper-bound. 
Further, as the Haralick energy is the sum of the squared elements of the co¬ 
occurrence matrix, computed for each 9x9 window, we also expect the noisy pixels 
to have higher probability than the contour pixels. The reader can see that this is 
the case, as shown in the left subfigure of Figure [Tl] After inverting, thresholding, 
and post-processing with morphological operations, we obtained the contour shown 
on the right of Figure [Tl] again, using just the information contained in 



Figure 10. Clean and noisy images 



Figure 11. Upper-bound image and recovered contour 
5. Conclusions 

This paper illustrates the ability of very recent developments in polynomial opti¬ 
mization and generalized moment problems to provide an elegant, computationally 
tractable, solution to anomaly detection problems. The proposed approach uses 
information from a set of moments of a probability distribution to directly provide 
an upper-bound p on the probability of observing a particular sample (modulo mea¬ 
surement noise), without explicitly modeling the distribution. A salient feature of 
this technique is that all the data, whether obtained using 1 or 1000 observations, 
enter the problem through a finite sequence of moments, that can be updated as 
more data is collected, without affecting the computational complexity (since new 
observations affect the value, but not the size of the moment matrices). Further¬ 
more, the detectors obtained by solving reached peak performance with fewer 
samples than the kernel and SVM methods. 
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Our examples implicitly assume that the observations contain enough informa¬ 
tion to make good decisions, that is, we have assumed that the samples are good fea¬ 
tures. When anomalous observations are available, it may be possible to construct 
optimal features to separate classes - in the same way support (or relevance) vector 
machine methods find separating curves - by maximizing the distance between the 
moment vectors of the two classes. We think this is an interesting direction for 
future research. 
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